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This paper deals with order identification for nested models in 
the i.i.d. framework. We study the asymptotic efficiency of two gen- 
eralized likelihood ratio tests of the order. They are based on two 
estimators which are proved to be strongly consistent. A version of 
Stein's lemma yields an optimal underestimation error exponent. The 
lemma also implies that the overestimation error exponent is neces- 
sarily trivial. Our tests admit nontrivial underestimation error ex- 
ponents. The optimal underestimation error exponent is achieved in 
some situations. The overestimation error can decay exponentially 
with respect to a positive power of the number of observations. 

These results are proved under mild assumptions by relating the 
underestimation (resp. overestimation) error to large (resp. moder- 
ate) deviations of the log-likelihood process. In particular, it is not 
necessary that the classical Cramer condition be satisfied; namely, 
the log-densities are not required to admit every exponential mo- 
ment. Three benchmark examples with specific difficulties (location 
mixture of normal distributions, abrupt changes and various regres- 
sions) are detailed so as to illustrate the generality of our results. 

1. Introduction. This paper is devoted to order identification problems 
in the independent and identically distributed (i.i.d.) framework. It fits in 
the general setting of model selection initiated by the seminal papers of 
Mallows [35], Akaike [1], Rissanen [38] and Schwarz [41]. Order identification 
deals with the estimation and test of a structural parameter which indexes 
the complexity of the common distribution of the observations. The purpose 
is to derive some new consistency and efficiency results. Order identification 
applies, for instance, to mixture models [42], where the order is (loosely 
speaking) the number of populations. Another example of application is 
abrupt changes models, where the order is (roughly) the number of changes. 
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It will be argued below that this example conveniently models a medical 
problem in which the order is the number of distinct levels of expression of 
a disease. 

1.1. Description of the problem. We observe n i.i.d. random variables 
Z\,...,Z n with values in a measurable sample space {Z,T) (Z is Polish). 
These observations are defined on a common measurable space upon which 
all the random variables will be defined. 

The distribution P* of Z\ may belong to one model in the increasing 
family {Hk}k>i of nested models. Here, each 11^ is a parametric collection 
of probability distributions which are absolutely continuous with respect to 
the same measure /i, 

U K = {Pe--0e@ K }cU K+1 , 

where {(®k, oIk)}k>i is an increasing family of nested metric parameter 
sets. In this paper dx will be abbreviated to d. 

The integer K is called the order of the model IIr- . It is also the order of 
any Pg G Tlx \ II^-i (with the convention Ho = 0). The order of P* is de- 
noted by K* . It is infinite whenever P* does not belong to IIoo = {Jk>i ^k- 

The central problem of this paper is an issue of composite hypotheses 
testing: we want to decide between the null hypothesis "K* < Kq" and its 
alternative U K* > Kq" (for some integer Kq), that is, to test 

"P* G II^ " against "P* £ U Ko ." 

This question is obviously crucial when the order is the quantity of inter- 
est. Furthermore, order identification may also be a prerequisite to consis- 
tent parameter estimation, when overestimation of the order causes loss of 
identifiability. 

1.2. Consistency and efficiency issues. Let a n and (3 n denote the type I 
and type II errors of a procedure that tests the hypotheses above. This 
procedure is consistent if a n and (3 n converge to zero as n tends to infinity. 
Its efficiency is measured in terms of rates of convergence of a n and j3 n to 
zero. 

In the classical statistical theory, a standard Neyman-Pearson procedure 
tests two simple hypotheses by comparing the log-likelihoods at each of them 
to a constant threshold. Now, it is known [10] that this procedure satisfies 

lim sup n~ 1 log a n < and lim sup n~ 1 log j3 n < . 

n— >oo n— >oo 

It is consequently natural, when investigating the efficiency of an order test- 
ing procedure, to study whether the rates of convergence are exponential 
with respect to n or not. 
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Two generalized likelihood ratio test procedures based on two different 
estimators of K* will be studied here. Obviously, }^K n estimates K* , then 
the natural rule is to reject the null hypothesis if K n > Kq. Then 

a n <P*{K n >K*} and (3 n < P*{K n < K*} 

(these upper bounds do not depend on Kq). According to the discussion 
above, we shall thus focus on the following issues: 

1. Are our order estimators strongly consistent? 

2. Can e u > or e Q > (the underestimation and overestimation error ex- 
ponents, resp.) be found such that 

limsupn^logP*-^ < K*} < -e u 

n— too 

or 

lfmsupn^logP*-^ > K*} < -e G ? 

n— *oo 

If so, can the error exponents e u or e Q be arbitrarily large? If not, what 
happens at a subexponential rate, that is, when replacing the factor n~ l 
by a factor v~ l = o(l), with v n = o(n)? 

The consistency issue 1 has been studied for two decades. The interest 
in the efficiency issue 2 is more recent. By formulating the efficiency issue 
this way, we adopt the error exponent perspective of the information theory 
literature [13]. This notion of efficiency is asymptotic, as are all our results. 
It is connected to other notions of asymptotic efficiency, among which is 
Bahadur efficiency [3]. The latter is usually derived from large deviations 
results. In the following, the underestimation (resp. overestimation) error 
will similarly be related to large (resp. moderate) deviations of the log- 
likelihood process. 

1.3. Results in perspective. Pioneering results about order identification 
of time series can be found in [2]. Strong consistency of the same order 
estimator in autoregressive models is shown in [24] and [26]. The test of 
the order of an ARMA process is addressed in [16]. Error exponents for 
autoregressive order testing are investigated in [8]. 

Consistent estimation of the order of a mixture model is at stake in [15, 21, 
27, 29, 30, 34]. Efficiency issues are addressed in [15]. Also, [16] is concerned 
with the test of the order of a mixture. 

Order estimation in exponential models is studied in [25]. The rates of 
underestimation and overestimation of two estimators of the order are in- 
vestigated in [25] (for exponential models), [31] (for regular models) and 
in [23] (for models characterized by the existence of an exhaustive finite- 
dimensional statistic) . 
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The problem of order identification in Markov models on a finite alphabet 
must be mentioned too. Some important papers are [12, 14] (they give insight 
into the consistency issue for some classical order estimators) and also [20, 
22] (where optimal underestimation error exponents are obtained for the 
same classical order estimators). A more comprehensive presentation of order 
identification in Markov models can be found in [7]. 

A new method for new results. In most previous work the choice of the 
framework is contingent on the need for tractable explicit calculus. In this 
paper we shall resort to general properties of empirical processes. Our ap- 
proach yields several new results that hold under mild assumptions. 

In particular, our test procedures admit nontrivial underestimation error 
exponents. Besides, one of them has an optimal underestimation error expo- 
nent in some situations. Any test procedure based on a consistent estimator 
is proved to admit a necessarily trivial overestimation error exponent. The 
overestimation probabilities of our procedures can decay exponentially fast 
with respect to a positive power of n. 

More details follow. 

Benchmark examples. Let us introduce very briefly our three benchmark 
examples. Their presentation is merely sketched here, including the results 
obtained by applying our main general results. A whole section will be de- 
voted to the detailed study of the examples. 

Let a denote a known positive number. 

• Location mixture example (LM): this is a notoriously difficult problem in 
the order identification literature (see the references cited above). In this 
model, one observes 

Zi = Xi + aei (i = l,...,n), 

where X±, . . . ,X n are i.i.d. hidden (i.e., not observed) random variables 
with a common distribution of finite support {mi , . . . , mx* } , and e± , . . . , e n 
are i.i.d. and independent from X±, . . . ,X n , with centered Gaussian dis- 
tribution of variance 1. The goal is to estimate K*. 

Applying the main general results of this paper will imply the following: 

1. Our two estimators of K* are consistent. 

2. Their underestimation error exponents are nontrivial and bounded by 
a number which depends on squared distances between P* and ILx, 
K = 1, . . . , K* — 1. Their overestimation error exponents are trivial but 
their overestimation probabilities decay exponentially fast with respect 
to a positive power of n. 

These results are new for maximum likelihood procedures. 
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• Abrupt changes example (AC): this example is original in the order iden- 
tification literature. In this model one observes 

Y i = f*(X i )+ae i (i = l,...,n), 

where X\ , . . . , X n are i.i.d. on a subset of M q (q > 2); e\, . . . , e n are i.i.d. 
and independent of X±, . . . ,X n , with centered Gaussian distribution of 
variance 1, and the function /* is piecewise constant. Loosely speaking, 
the goal is to estimate a minimal number of domains on which /* is 
constant. 

In virtue of the general results of this paper, the following new results 
hold ("almost surely" abbreviates to "a.s."): 

1. P*-a.s., our estimators are greater than or equal to K* eventually. 

2. Our tests admit nontrivial underestimation error exponents. Their over- 
estimation error exponents are necessarily trivial. 

• Various regression examples (VR): let {tk}k>i be an orthonormal system 
in L 2 ([0, 1]). In this model one observes 

Yi = r(Xi) + aei (i = l,...,n), 

where X\ , . . . , X n are i.i.d., uniformly distributed on [0,1], 6i,...,e n are 
i.i.d. and independent of X\, . . . ,X n , with centered Gaussian distribution 
of variance 1, and /* = J2k=i ®ktk with Ok* ^ 0. The goal is to estimate 
K*. 

As a consequence of the main general results of this paper, the following 
results are obtained: 

1. Our two estimators of K* are consistent. 

2. Their underestimation error exponents are nontrivial, and one of them 
achieves optimality. Their overestimation error exponents are necessar- 
ily trivial, but their overestimation probabilities decay exponentially 
fast with respect to a positive power of n. 

In particular, the optimality of one of the underestimation error 
exponents is a new result. 

1.4. Organization of the paper. In Section 2 some notation precedes the 
definition of the order estimators studied here. The basic assumptions are 
stated. Moreover, two limit theorems for the log-likelihood process which 
will play a central role are recalled. The consistency results are stated and 
commented on in Section 3. The most conclusive part is Section 4. It is 
devoted to the statement of the efficiency results and comments. The ap- 
plication of our general results to the benchmark examples is addressed in 
detail in Section 5. The proofs are postponed to the Appendix. 
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2. Notation and preliminaries. The integral / / dX of a function / with 
respect to a measure A will be written as A/. Besides, all the expressions 
involving extrema and empirical processes will be assumed measurable. 

2.1. Two maximum penalized likelihood estimators. Let pg denote the 
density of Pg with respect to \i and Eg = logpg (for all 6 € ©oo = Uk>i ©if)- 
P* is supposed to be absolutely continuous with respect to \x without loss 
of generality. Its density is denoted by p* and we set £* = logp*. If P* € 
Hjr* \n K *_i, then P* = Pg, for 6* G \ 

The log-likelihood ^ n of the observations is 

n 

*n(0) =J>(^<) (every Goo). 
i=l 

The penalized maximum likelihood criterion for the model ILx is written as 

crit(n, if) = sup £ n (0) — pen(n,K), 
9ee K 

where pen is a positive penalty function. It yields the two estimators of the 
order studied in this paper, 

K% = M{K > 1 : crit(n, K) > crit(n, K + 1)}, 

K% = inf argsup{crit(n,i^)} > K\. 
K>1 

is a global (hence, the G in its name) maximizer of the criterion. 
always bounds from above the first local (hence, the L) maximizer of 
the same criterion. Note that the computation of these estimators is a less 
demanding algorithmic task for than for K®. 

Comment. A prior bound K m3jX for K* will be assumed known when 
studying the overestimation properties of K~*- Indeed, we cannot control its 
overestimation probability when infinitely many models are involved. This 
assumption is common in the order identification literature [2, 7, 20, 21, 23, 
24, 25, 30, 31]. 

On the one hand, there are situations where assuming the existence of 
-f^max is mandatory. It is, for instance, proven [14] that some classical (mini- 
mum description length) order estimators are not consistent when no upper 
bound to the true order is known a priori: they fail to recover the true order 
of a uniformly distributed i.i.d. sequence on a finite alphabet A, when 
Uk is the set of all Markov chains of order at most K. On the other hand, 
it is also shown in the same paper that the so-called Bayesian information 
criterion (BIC) order estimator is consistent when no upper bound is known 
a priori. It is thus particularly interesting that the study of the properties 
of does not require a prior bound for K*. 
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Now, it must be emphasized that our asymptotic study of the problem 
does not allow us to obtain conditions on the dependence of pen(n, K) on K. 
In contrast, the former BIC order estimator studied by Csiszar and Shields 
[14] corresponds to pen(n,K) = ^l^l — l)logn. It is believed that this 
is a minimal penalty. In [22] the dependence on K of the penalty function 
is also made precise (but the penalty is certainly not minimal, according to 
the authors). 

The dependence of pen(n,i^) on K could be investigated through risk 
bounds for maximum log-likelihood [6, 36] (in the testing framework of this 
paper, the chosen loss function is K h- > 1{K ^ K*}). However, this would 
require at present time some restrictive assumptions. For instance, exact 
asymptotic risk bounds are yet out of reach for a mixture of Gaussian dis- 
tributions. Furthermore, exact asymptotic bounds are not enough in over- 
estimation, when we have to deal with infinitely many models [12]. 

2.2. Basic assumptions. Let us denote by H(P\Q) = PlogdP/dQ if 
P <^Q, H(P\Q) = oo otherwise, the relative entropy of P with respect to 
Q. A survey of the relative entropy properties can be found, for instance, in 
[19]. If II is a subset of M\(Z) [the set of all the probability measures on 
[Z, J 7 )], the infimum of H(P\Q) for P (resp. Q) ranging through II will be 
denoted by H(U\Q) [resp. H(P\U)]. 

The following assumptions will be needed throughout this paper: 

Al. Compactness assumption. For all K > 1, the parameter sets (@K,d) 

are compact metric sets and the models Hk are compact for the weak 

topology on the space M\{Z). 
A2. Parameterization assumption. The parameterization 9 \— > £$(z) from Qk 

to M. is continuous for all z € Z and K > 1. 
A3. Bracket assumption. There exist l,u € M" 2 such that (u — I) G L 1 (P*) 

and 

l<t<u and l<£g<u (all 5 G Goo). 

A4. Penalty assumption. 

pen(n, •) is an increasing function for all n > 1. 

pen(n, K) — > oo as n — > oo and pen(n, K) = o(n) for all K > 1. 

The continuous parameterization assumption A2 is standard in statistics 
(see, e.g., [43]). Assumption A3 is called "bracket assumption" after the 
definition of the bracket [l,u] (which is the set of all functions / with I < 
f < u). It is also standard in the literature to invoke A3 when empirical 
processes are involved [43]. Another standard assumption in this setting is 
the boundedness of the parameter set. Assumption Al is slightly stronger (at 
least when the parameter set is finite-dimensional, by virtue of the Heine- 
Borel theorem, A2 and Levy's continuity theorem). Assumption A4 is the 
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minimum requirement for a penalty function. Finally, it is worth noting that 
A3 implies that H(P*\P e ) is finite for all 9 G Goo- 

2.3. Large and moderate deviation of the log-likelihood process. It is shown 
in Section 4, which is devoted to efficiency issues, that underestimation can 
be related to large deviations of the log-likelihood process, while overestima- 
tion can be related to moderate deviations of the latter. Large and moderate 
deviations of the log-likelihood process both describe the limiting behavior 
of the empirical measure P n = n~ l Y^a=\ $Zi {&z denotes the Dirac measure 
at z) on rare events as n goes to infinity. Let us state the principles we shall 
need (their lower bounds are omitted). 

Extended Sanov theorem [32]. Let r be given by r(s) = exp(|s|) — \s\ — 1 
(all sGl). The classes 

(1) Ct(P*) = {/ € R Z : 3 a > 0, P*r{f/a) < oo}, 

(2) M T (P*) = {/ G R z : Va > 0, PV(//a) < oo} C C T {P*) 

will play a central role in our study. £ T (P*) [resp. M T (P*)] is the set of 
all functions on Z that admit some (resp. any) exponential moment with 
respect to P*. In the LM example, for instance, if a continuous function / 
upon R satisfies / = 0(x 2 ) at infinity, then / 6 C T (P*). For such a function, 
/ G M T (P*) if and only if / = o(x 2 ) at infinity. This simple example will be 
particularly interesting when / is a log-density £g [which is an 0(x 2 ) but 
not an o(x 2 )] or a difference (Iq — £*) [which is an o(x 2 )]. 
When equipped with the norm 

(3) ||/|| T = inf{a>0:P*r(//o)<l} (all /€£•), 

C T (P*) is a Banach space. Its topological dual is denoted by C' T {P*). In this 
paper we shall be particularly interested in the set 

Q = {<2 e : Q > 0, Ql = 1} U 7?, 

where V = {p~ l J2i=i 5zi'-P>^-,zi,... : z p €L Z}. It is equipped with the coars- 
est topology that makes the linear forms Q i— > Qf continuous for every 
/ G £ T (P*) and with the coarsest cr-field that makes them measurable. It is 
worth noting that F n G Q fl V = V, hence, the need for V . 

By definition, Q G Q is P* -singular if there exists a sequence {A p } of 
measurable sets such that = for all p > 1, while linip^oo P*{A p ) = 

0. It is known (Theorem 2.3 and Proposition 2.4 in [32]) that: 

Lemma 1. Any Q G Q fl C' T {P*) is uniquely decomposed into the sum 
Q = Q a + Q s , where Q a G C' T (P*) is a probability measure, Q a <C P* , while 
Q s G £' T (P*) is P* -singular and Q s > 0. Besides, for every f G M T (P*), 

Qf = Q a f- 
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Remark 1. Qn£' T (P*) is not a subset of M X {Z). If Q e Q n C' T (P*), 
then P(A) = Qt{A} (for any measurable set A) does define a probability- 
measure P, which is in fact Q a . Besides, P and Q coincide on M T (P*), but 
may differ on C T {P*) \ M T (P*) (Q = P = Q a if and only if Q s = 0). 

Let us finally introduce the nonnegative function I (the extended relative 
entropy) defined for any Q = Q a + Q s € Q n £^(P*) by 

/(g) = H{Q a \P*) + sup{Q s / : / € C T (P*),P* exp(/) < oo} 

and I(Q) = ooif(5€Qn7 :, = 7 : '. It particularly satisfies the following: 

Lemma 2. For ewery Q € Q, /(Q) > 0, with equality if and only if Q = 
P*. 

Theorem 3.2 in [32] encompasses the following result. 

Theorem 1 [32]. The function I is a convex, lower semicontinuous 
mapping from Q to [0, oo]. Its level sets {Q 6 Q:I(Q) < ct} are compact 
for all a > 0. Moreover, for any measurable S C Q [with closure cl(S')], 

limsupn _1 logP*{P n G5}<- inf I(Q). 

n—KX> QScl(S) 

Remark 2. Theorem 1 requires an involved setting. Three reasons mo- 
tivate its use, though: 

• A classical Sanov theorem on M\(Z) would be insufficient here. Indeed, 
when dealing with the underestimation rate, our proofs require that the 
linear forms Q t— > Q£g be continuous on Q (any 9 € Ooo), while possibly 
ie G C T (P*) \ M T (P*). Now, Schied [40] has shown that the extension of 
a Sanov theorem on M\{Z) to a topology on M\ {Z) that makes the linear 
form Q i— > Qf continuous on Q for some / € C T {P*) is possible if and only 
if / € M T {P*) (this is the classical Cramer condition). 

• Provided the need that Q i— > Qf be continuous on Q for various / € 
C T (P*) \ M T (P*), the topology on Q introduced above is the natural 
one. 

• The simpler relative entropy rate function I'(Q) = H(Q a \P*) for Q = 
Q a + Q S € Q H C' T (P*), I' iff) = oo otherwise, does not have compact level 
sets (this is also a consequence of [40]). This would be a major drawback 
in our scheme of proof. 
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Moderate deviations o/P n [44]. Let Q denote a subclass of L 2 (P*) with 
envelope G6l 2 [i.e., \g(z)\ < G(z) for all g G Q and z&Z]. 

Let 1°°{Q) be the collection of all bounded functions b G MP . The uniform 
norm || • \\g defined by \\b\\g = sup ge g \b(g)\ induces a topology and a er-field 

on e°°(g). 

Let us denote by Mq (Z) the space of all signed measures Q on (Z,T) 
that satisfy Ql = 0, sup g£ g \Qg\ < co and Q <C P* (the derivative dQ/dP* 
is denoted by g). One observes that, for any Q G Mq(Z), Q°°g = Qg (all 
g G £/) defines an element of £°°(Q). Particularly, (P n — is a random 

variable on £°°(Q) under P* . 

Let us finally introduce the nonnegative function J defined for any b G 
£°°(G) by 

J(b) = inf jp*^ : Q G M (Z),Q°° = &} 
(with the convention inf = +oo). 

Theorem 2 [44]. Lei {w n } an increasing sequence of positive numbers 
such that v n = o(n), nlogn = o(v 2 ). Let us assume that there exist A > 1, 
5 G (0, 1) such that, for every k,n>l, 

V n k < Ak^Vn. 

IfQ is P*-Donsker and G G C T (P*), then for any S C (£°°(Q), \\ ■ \\g), 
limsup(^/n) _1 logP*{nv~ 1 (P n - P*)°° ES}<- inf J(b). 

n— >oo becl(S) 

This theorem is a straightforward corollary of Theorem 5 in [44] (for a 
recent account of the P*-Donsker property, see [43]). 

3. Consistency issue. The statements of our three results of consistency 
are gathered here. These results are rather routine. However, the resort 
to empirical process arguments allows us to achieve great generality. We 
refer to Section 5 for examples of application and comparison with previous 
consistency results in each benchmark framework. 

From now on, Log denotes the truncated log, that is, Log(x) = log (a; V e) 
(all x G R). The function ip is defined by (f(x) = x 2 /LogLog(x) (all x G R). 
Besides, let us introduce the classes of functions 
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Qk = {9e = (ie -t*):6e G K } (every K > 1). 
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Theorem 3. Let P* belong to LLx* \LTr:*_i. Suppose that ip(u — I) € 
L 1 (P*) and that the penalty function satisfies 

pea(n,K + l) (nloglogn) 1 / 2 1^1 \ 

limint — > 1 and limsup — = [any K > 1) 

n->oo pen(n,AJ n->oo pen(n,K) 

• If P* Hk implies H(P*\Hk+i) < H(P*\Hk), and if, moreover, Gx*+i 
is P* -Donsker, then P*-a.s., K\ = K* eventually. 

• If K* < K max , and if, moreover, Q K is P* -Donsker, then P*-a.s., 

= K* eventually. 

It is proved in Section 5 that the theorem applies to the LM and VR 
examples. In the AC example, it is obtained that i-**-a.s., K\ > K* and 
> K* eventually (see Proposition B.l). 

The scheme of proof of the latter theorem is rather standard. The proof of 
"no underestimation eventually" relies on the strong law of large numbers. 
It essentially requires the continuous parameterization assumption A2 and 
finally boils down to a comparison of the following: 



- H(P*\U K ) with H(P*\U K+ i) for all K < K * - 1 when dealing with K% 
(hence, the assumption that strict inequality holds); 

- iJ(P*|n^*-i) > with H(P*\IL K *) = when dealing with (this com- 
parison is obvious). 

The proof of "no overestimation eventually" relies on a law of the iterated 
logarithm. It essentially requires the P*-Donsker assumptions. 

We emphasize that the condition on the penalty function in Theorem 3 ex- 
cludes BIC-like expressions pen(n,ET) = | dim(0j<-) log re. This can be over- 
come, as shown in Theorem 4, by resorting to an example of a "peeling 
device" (see Appendix A). To this end, substitutes for Q K classes are intro- 
duced, namely, 

(5) Q\ = {9e = J^^-- 6 ^ ®k,H(0) > oj (every K > 1), 

where H{6) = H(P*\P e ) for all 9 € G^. 

Theorem 4. Let P* belong to IIk* \ H-K*-i- Suppose that (p(u — I) € 
L l (P*) and that the penalty function satisfies 

. pen(n,K +1) loglogre 

Iiminf — >1 and limsup r = U (any K >1). 

pen(n,-fT) n->oo pen(n,K) 



n— >oo 



If P* ^Il K implies H(P*\U K+ i) < H(P*\H K ), and if, moreover, G K *+i 
is P* -Donsker, then P*-a.s., K\ = K* eventually. 
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• If K* < K max , and if, moreover, Q\ is P* -Donsker, then P*-a.s., 
= K* eventually. 

This theorem applies to the LM example as proved in Section 5. 
Finally, the last result of this section addresses a case of misspecification. 

Theorem 5. Suppose that, for every K > 1, < H(P*\H K+ i) < H(P*\ 
U K ). Then P* <£ and P*-a.s., 

-T ( 1 

lim inf K„ = lim inf K„ = oo. 

?woo 11 re— »oo 

The proofs are postponed to Appendix B. 

4. Efficiency issues. It is argued in Section 1 that the efficiency^ issue in 
order identification problems is related to the decay to zero of P*{K n < K*} 
and P*{K n > K*} (for K n = or K%) as n tends to infinity. A comparison 
with standard Neyman-Pearson tests suggested investigating whether they 
can vanish exponentially fast with respect to n or not. This is the question 
at stake in the next section. 

4.1. Best error exponents. We shall resort hereafter to a concise version 
of Stein's lemma (our Lemma 3) due to Bahadur, Zabell and Gupta [5] 
(see their Theorem 2.1, specialized here to the case of i.i.d. processes for 
sake of simplicity). An early version is mentioned in [10] in a framework of 
hypotheses testing, and stated, for instance, in [4]. Lemma 3 relies on the 
core of Stein's original proof (which is a change of probability argument). It 
is, in most cases, the key of its various versions. 

Lemma 3 [5] . Let P, Q be two probability measures on the same measured 
space and {X n } be a sequence of random variables on it. Let {A n } be a 
sequence of measurable sets such that A n is cr(Xi, . . . , X n ) -measurable. 

Assume that P = P® 00 and Q = Q®°° , so that {X n } is an i.i.d. process 
under P and Q. 

(6) // liminfQM n ) > 0, then liminf ra" 1 logP(A re ) > -H(Q\P). 

n — >oo n — >oo 

Now, by virtue of Lemma 3: 

Theorem 6. Let K n be any estimator of the order of the common dis- 
tribution of Z\, . . . , Z n . 

Underestimation. If for all Kq > 1 and Pq £ T1k \ ^k -i, 

(7) lim sup P {K n > K } < 1 , 

n— »oo 
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then 

liminfn -1 logi*{^„ <!£*}>- inf H(U K \P*) = -H(H K *-i\P*). 

n-+oo K<K* 

Overestimation. If for all Kq > 1 and Po E Hk \ Hk -i, 
limsup P {K n <K } < 1, 

n— *oo 

then 

lim inf rT 1 log P*{#„ > K*} = limsupn" 1 logP*{K n > K*} = 0. 

n >oo n — yoo 

Proof. Set K < K* and 9 £ @k - Choose the probabilities P = P*, 
Q = P 0Q and define A n = (K n < K ). 

The left-hand side condition of (6) is satisfied by virtue of (7), hence the 
right-hand side property of (6) holds. Now, P*{K n < K*} > P*{A n } and 
Kq,6q are arbitrary, so the proof in the underestimation case is complete. 

The proof in the overestimation case parallels the lines above. □ 

Analogous versions of this theorem have been proved in [20] and [22] in 
settings of Markov chains and hidden Markov models order identification, 
respectively. It is, however, and surprisingly a new result (to the best of 
our knowledge) in our framework of order identification from i.i.d. observa- 
tions. In summary, the underestimation (resp. overestimation) result holds 
for estimators K n that ultimately overestimate (resp. underestimate) the 
order with a probability bounded away from one. Thus, the theorem applies 
to any consistent estimator. Besides, the conclusion of Theorem 6 for such 
estimators is twofold: 

• The underestimation probability can decay exponentially fast with re- 
spect to n, and a best possible underestimation error exponent, namely, 
#(LIk*_i|P*), is exhibited. 

• The overestimation probability cannot decay exponentially fast with re- 
spect to n: the overestimation error exponent is necessarily trivial. 

Consequently, the main issue is now to prove that and admit 
nontrivial underestimation error exponents and to compare those exponents 
to H(T1k*-i\P*)- This will involve large deviations of the log-likelihood pro- 
cess; see Section 4.2. The second issue is to investigate the behavior of the 
overestimation probabilities. These probabilities will be related to moderate 
(instead of large) deviations of the latter process; see Section 4.3. 
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4.2. Underestimation error exponent. Let us introduce for any a > and 
K > 1 the following subsets of Q (A stands for Local and T for Global): 

(8) k a ^ K = \ Q € Q : sup Ql e - sup Q4>-aL 

I ee&K s&<3>k+i > 

(9) Y a%K = (q G Q: sup - sup Q£ e > -aj. 

{r a ,A-} is nondecreasing in a and if, and {A aj ^-} is nondecreasing in a. 
Besides, for every a > and K < K*, T a ^K C A a ^. Finally, r Qj ^* = Q. 
From now on, let us suppose that P* & Hr* \ ^K*-l- 



Theorem 7. Let us assume that, for every 9 € &oo, H(Pg\P*) is finite 
and pqIq € L 1 ^). Let us also suppose that: 

(i) {ie-.Oee^jcCAP*). 

(ii) For all K > 1, for every Q € Q and e > small enough, there exists 
a finite subset T C ®k such that 

y 9 £ Q K ,3t eT :\Qe e - Q£ t \ < e. 

• IfP*£H. K implies H(P*\Il K+1 ) < H(P*\I1 K ), then 

(10) Hmsupn -1 logP*{tf£ <!£*}<- inf I(A ,^) < 0. 

n— >oo K<K* 

• Moreover, 

(11) lim sup rT 1 log P*{^ < K*} < -I(T 0}K *-i) < 0. 

n-+oo 

Furthermore, if P*-a.s. for any n > 1, Z\,...,Z n are mutually distinct, 
then "for every Q £ Q" may 6e replaced by "for every Q € Qfl^(P*)" in 
(ii) [ yielding (ii)']. 



This theorem fully applies to the LM, AC and VR examples, as proved 
in Section 5. 



Remark 3. The alternative assumption (ii)' is needed for the AC ex- 
ample. The proof of the theorem is slightly more involved with the relaxed 
condition. It particularly requires a more precise framework for the large de- 
viations principle of Theorem 1 (refer to the proof in Section C.l for further 
details). 



TESTING THE ORDER OF A MODEL 



15 



Comment on Theorem 7. Theorem 7 is the most conclusive result of this 
paper. It notably relates the phenomenon of underestimation to the large 
deviations of the log-likelihood process. The assumptions of the theorem are 
mild and give, we think, insight into the phenomenon of underestimation. 
This assertion is justified by the fact that the assumptions are satisfied in the 
three benchmark examples, despite their differences and specific difficulties. 
This is due to the resort to empirical processes arguments (and recent ad- 
vances in large deviations theory) in place of tractable explicit calculus. Let 
us emphasize that Theorem 7 applies even when the log-densities £g admit 
some exponential moment rather than any (the classical Cramer condition). 

Comparison with previous results on the rate of underestimation in each 
benchmark framework can be found in Section 5. 

Besides, a comparison with [23] is relevant. In the latter, the authors con- 
sider an order estimator based on the minimization over a finite- dimensional 
parameter set of an empirical criterion U n {9). The basic assumption requires 
the existence of a finite- dimensional statistic T n which satisfies an expo- 
nential maximal inequality and the existence of a continuous function U 
such that U n {9) = U(8,T n ). In this framework, tractable calculus in finite 
dimensions yields some nonasymptotic evaluation of the underestimation 
probability (and the overestimation probability too). The scope of the pa- 
per is large, although its basic assumption excludes mixture models (and 
particularly LM) because there is no finite-dimensional statistic T n for the 
log-likelihood; it also excludes models with infinite-dimensional parameter 
sets (and particularly AC). 

Optimal underestimation error exponent. We aim at showing that, un- 
der appropriate further assumptions, achieves the optimal underestima- 
tion error exponent. Theorem 8 is an intermediate result, in which possibly 
tighter upper bounds for the probabilities of underestimation are stated. 

Let us reinforce the structure of the spaces &k'- now, the distance d on 
@K derives from a norm || • || on the vector space @k, so that the notion 
of differentiability with respect to 8 £ 0^ is available. This particularly 
excludes the AC example. 

Theorem 8. Suppose that the assumptions of Theorem 7 are valid. In 
addition, assume that: 



(iv) For every K < K* and z & Z, the functions 9 \— > £g(z) are differen- 



(iii) (u-l)eM T (P*). 
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• IfP*^n K implies H(P*\Il K+1 ) < H{P*\U K ), then 

limsupn -1 logi*{#?; <!£*}<- inf H(A K n Mi{Z)\P*) < 0. 

• Moreover, 

lim sup n -1 log P*{^ < if*} < -F(r ,^*_i n Mx(2:)|P*) < 0. 

n— »oo 

This theorem fully applies in the LM and VR examples, as proved in 
Section 5. 

Remark 4. In assumption (iv), inequality (12) may hold only for h = 
\\h\\ek, where e^ is the fcth canonical basis vector of Ok- 

In conclusion, the underestimation error exponent turns out to be optimal 
(regarding Theorem 6) for in exponential models. This is another new 
result. It applies particularly to our sole exponential model, that is in the 
VR example, as shown in Section 5. 

Theorem 9. Under the assumptions of Theorem 8 and for exponential 
models, the best underestimation error exponent (regarding Theorem 6) is 
achieved by K®. 

Comment on Theorems 8 and 9. It is easily seen that the upper bounds 
of Theorem 8 are indeed lower than the ones in Theorem 7, but possibly 
not strictly. Are there situations where these inequalities are known to be 
strict or not strict? What is the nature of the discrepancy between the opti- 
mal exponent and the one obtained in Theorem 7? These are very difficult 
questions, to which we do not have any answer. Boucheron and Gassiat [8] 
faced the same impediment when they studied the underestimation error 
exponent of a procedure which tests the order of an autoregressive process. 
They first show that their order estimator has nontrivial underestimation 
error exponent. A version of Stein's lemma yields an optimal error exponent. 
They finally check that, in some situations, the optimal error exponent is 
achieved. In both the present work and theirs, the main difficulty stems from 
the absence of a "full information-theoretical interpretation" (we quote their 
expression) of the large deviations rate function — that is, stems from the dis- 
crepancy between the rate function and the relative entropy. 

It is also worth emphasizing that, although the optimal underestimation 
efficiency is proved for in exponential models (see Theorem 9), we cannot 
conclude that and are not optimal on the basis of Theorems 7 and 
8. In greater generality, we are not aware of any example of order estimator 
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proven to be suboptimal regarding the underestimation error exponent in 
the statistical or information theoretical literatures. 

The proof of Theorem 9 (postponed to Section C.3) involves H -projections 
as defined and studied by Csiszar [11]: Q is the //-projection of the prob- 
ability measure Q on a convex set of probability measures C [which must 
satisfy H(P\Q) < oo for some P € C] if Q € C and 

(13) H(Q\Q) = H(C\Q). 

//-projections satisfy a useful characterization (see Theorem 2.2 in [11]): 

Lemma 4 [11]. Q' e C with H(Q'\Q) < oo is the H -projection of Q on C 
if and only if, for every P € C, 

H{P\Q)>H{P\Q') + H(Q'\Q). 

Nonetheless, the proof also involves probability measures P, P and a set 
C which satisfy P £ C and 

(14) H(P\P)=H(P\C). 

We shall say by analogy that P is the reversed- //-projection of F on C. 
Such reversed- //-projections are much less tractable than //-projections. 
In general, notably, a reversed- //-projection cannot be characterized as in 
Lemma 4. However, it is remarkable that, in exponential models, reversed- 
//-projections do satisfy a similar characterization (the proof draws its in- 
spiration from [9]): 

Lemma 5. Set Q e C' T (P*) n M X {Z) such that H(Q\U K *) < oo. Then 
Q <C fj, (let q denote its density dQ/dfj,). Now, let us assume that: 

(i) Hk* is an exponential model. 

(ii) {ee:eeQK*}cCr(P*). 

(iii) Q log q < oo. 

(iv) The function 9 t— > Q£g is continuous from @k* to R. 

Let P belong to ILk* ■ P is the reversed-H -projection of Q on Hr* if and 
only if 

(15) H(Q\Pg)>H(Q\P)+H(P\P e ) (anyP e £U K *). 

These characterizations will play a central role in the proof of Theorem 9. 
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4.3. Overestimation rate. The following theorem provides a first link be- 
tween the penalization function and the rate of overestimation (which is 
necessarily slower than exponential in n; see Theorem 6) that it yields for 
and K%. 

Theorem 10. Let the penalty function be of the form pen(n,K) = v n D(K) , 
where D € M N and {v n } increase, v n = o{n), and for some A > 1, 5 s (0,1), 
for every k,n> 1, 

v n k < Ak l ~ 5 v n . 

Let us also suppose that: 

(i) (u — I) G C T (P*), so that the classes G\ [defined in (4)] admit an 
envelope function in C T (P*). 

(ii) n = o{vl). 

• // Gk*+i is P* -Donsker, then 

(16) lim supra" 2 log P*{K% > K*} < 0. 

n— >oo 

• If K* <K max , and if, moreover, (?Jf max is P* -Donsker, then 

(17) limsupmv; 2 \ogP*{K^ > K*} < 0. 

n— >oo 

For instance, v n = n 1 "* 5 , 5 € (0, 1/2) is an admissible sequence and Theo- 
rem 10 applies to the LM and VR example. 

The resort to the same "peeling device" that allowed the transition from 
Theorem 3 to Theorem 4 (both devoted to the consistency issue) in Section 3 
yields again a relaxed condition on {v n }. 

Theorem 11. Let pen be of the form detailed in Theorem 10. Let us 
also suppose that: 

(i) The classes Q\ [defined in (5)] admit an envelope function in £ T (P*) . 

(ii) \ogn = o(v n ). 

• If Gfc*+i is P*-Donsker, then 

(18) lim sup v~ x logP*{^ > K*} < 0. 

n— >oo 

• IfK*<K max , and if, moreover, Gj( is P* -Donsker, then 

(19) limsupu- 1 logP*{^>ir*}<0. 

n— >oo 

For instance, v n = (logn) 1+e (e > 0) is an admissible sequence. 
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Comment on Theorems 10 and 11. Theorem 10 is the main result on the 
efficiency issue of overestimation in this paper. It notably relates the phe- 
nomenon of overestimation to the moderate deviations of the log-likelihood 
process. The assumptions of the theorem are rather mild. This opinion is jus- 
tified by the fact that the theorem applies to the LM and VR examples. It is 
worth pointing out that the conditions related to the log-densities lg are ex- 
pressed in terms of the envelope function and the P*-Donsker property (and 
not in terms of exponential moments for ig). As explained in Section 5, the 
AC example is excluded because we do not verify the P*-Donsker property. 

On the contrary, Theorem 11 relies on strong assumptions, particularly 
assumption (i), which exclude the LM and VR examples. Although the con- 
dition on {v n } is relaxed, Theorem 11 does not apply to the BIC-like penalty 
function pen(ra,.ff) = ^ dim(0j^) logn (v n =logn). Besides, it is important 
to note that the choice of v n = (logn) 1+e yields control of the overestimation 
probability that decays like a negative power of n. 

We refer to Section 5 for comparison with previous results on the rate 
of overestimation in the LM and VR benchmark examples (none exists for 
the AC example). The last paragraph of the comment of Theorem 7 is also 
relevant here, as a paradigm of the methods based on tractable calculus in 
finite dimensions. 

5. Benchmark examples. This section is devoted to a detailed investi- 
gation of our benchmark examples in order to illustrate the collection of 
results that have been stated in the two previous sections. 

5.1. Location mixture example. Let a be a priori known and j(-;m) de- 
note the density of the Gaussian distribution with mean m and variance 
a 2 with respect to the Lebesgue measure [i on R. Let M. be a compact 
subset of R. Here, ITi is the set of all Gaussian probability measures with 
mean m € M and variance a 2 and Q\= M. For every 9 € 0i, let us define 
Pe = t("! Now, for any K > 2, let us introduce the compact sets 

e K = jfl = (7T, Hi) :7T = (7TL, . . . , ITK-l) € R+~\ X> fc < 1 , m G A ' j . 

Every 9 6 @k (K > 2) is associated with a mixing distribution Fg = J2k=i x 
5 mk + (1 — J2k=i)°~m K on M. and a probability measure Pg with density 
pg = J M 7(-; m) dFg(m) with respect to fj,. For K > 2, n^- = {Pg : 9 G 6^}- 
In this setting, one observes 

Zi = Xi + aei (i = l,...,n), 

where X±, . . . , X n are i.i.d. hidden random variables, e±, . . . , e n are i.i.d. and 
independent of X±, . . . ,X n , with centered Gaussian distribution of variance 
1, and there exists 9* £ Qk* \ ®k*-i such that X±,..., X n have distribution 
Fg* . In this case, Z\,...,Z n are i.i.d. and P*-distributed. 
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Exploring the assumptions. The compactness assumption Al is easily 
verified (by virtue of Levy's continuity theorem). The continuous param- 
eterization assumption A2 is satisfied. Defining I = m££g and u = supig 
(the suprema range over 6 G 0oo) ensures I < ig < u (all 6 G Goo) and 
(u — l) 1+c G C T (P*) for some c> 0. Hence, the bracket assumption A3 holds. 
Now, a slight adaptation of the proof of Lemma 3 in [34] yields the following: 

Proposition 1. Let F be a mixing distribution on A4 (possibly with 
infinite support) and P* have density p* = Jj^-y(-;m)dF(m). In the LM 
example, if P* $ U K , then H(P*\Il K+1 ) < H(P*\U K ). 

The classes Q\ are P*-Donsker (indeed, Example 19.7 in [43] guarantees 
that they have finite bracketing entropy integral). It can also be proven 
by hand that the classes Q\ are P*-Donsker too (they have e-bracketing 
numbers bounded by a polynomial in e~ 1 , hence, finite bracketing entropy 
integral; see [43] for details). Consequently, the consistency conclusions of 
Theorems 3 and 4 are valid. 

As for the efficiency issue of the underestimation rate, the assumptions of 
Theorems 7 and 8 are verified in this example. If P* G Hk*\^k*-i, it is clear 
that, for every 9 G Goo, H(P e \P*) is finite, pgtg G L 1 (/i) and tg G C T {P*) 
[this is assumption (i) of Theorem 7]. Moreover, as proved in Section E.l 
(essentially by virtue of Ascoli's theorem applied to the restrictions of the 
Ig's to a compact set) we have the following: 

Lemma 6. In the LM example, the finite sieve assumption (ii) of The- 
orem 7 is satisfied. 

Assumption (iii) of Theorem 8 holds because (u — l) l+c G C T (P*), hence, 
(u — I) G A4 T (P*). Furthermore, it can be shown (resorting to Taylor's inte- 
gral remainder formula, e.g.) that assumption (iv) of Theorem 8 also holds, 
so that the latter applies in the LM example. 

Finally, the assumptions of Theorem 10, which deal with the efficiency 
issue of the overestimation rate, have already been verified above. 

In summary, Theorems 3, 4, 6, 7, 8 and 10 apply in the LM example. 

Comment. Order identification in mixture models, even with known 
standard deviation, is a notoriously difficult problem. 

Mixture models have been postulated in many applications; see Chapter 2 
of [42] for a scope of these applications. Mixture models are notably char- 
acterized by their lack of identifiability when overestimating the order, and 
the subsequent singularity of the Fisher information matrix, which prevents 
one from using classical methods based on a Taylor expansion. They are also 
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known for the tediousness of the related calculus. Besides, the log-densities 
£0 do not belong to M T (P*) (the strong Cramer condition is not satisfied), 
hence, the need for theorems that apply to the case of £q S C T (P*)\M T (P*) . 

Let us review here the previous results of order identification in mix- 
ture models that can be found in the literature. Regarding the consistency 
issue, Henna [27], Dacunha-Castelle and Gassiat [15] and James, Priebe 
and Marchette [29] proved the consistency (without any prior bound on 
the true order) of three different order estimators which do not rely on a 
maximum likelihood procedure. The consistency of our estimator K% (with 
prior bound) has been already proven in this setting in [30] and [21]. The 
proof in [30] involves the locally conic parameterization of a mixture model 
introduced in [16] (this parameterization allows one to cope with Taylor ex- 
pansions). The proof of [21] relies on a clever inequality for likelihood ratios 
which makes her proof very simple. 

As for the efficiency issue, we have made clear in Sections 4.2 and 4.3 
that large or moderate deviations of the log-likelihood process are a rea- 
sonable (and certainly minimal) requirement in order to yield asymptotic 
bounds on the probabilities of underestimation and overestimation. Hence, 
the locally conic parameterization does not appear adequate to yield jsuch 
bounds. Dacunha-Castelle and Gassiat [15] proved that their estimator K n = 
argmaxK {U n (K) + pen(n, K)} [where U n (K) depends on the data and K 
and differs from the log- likelihood maximized on @k, using our notation 
of Section 4.3 for pen] satisfies, for some c\,C2 > and n large enough, 
P*{K n 7^ K*} < c\ exp(— C2n~ 1 v'^ l ). The corresponding rate is the one of 
Theorem 10. 

As far as we know, our results on efficiency stated in Theorems 6, 7, 8 
and 10 are new for our maximum likelihood procedures. 

5.2. Abrupt changes example. Let (X,B,P) be an open subset of M g 
{q > 2) equipped with the trace B of the Borel cr-field and a probability 
measure P<C/i, the Lebesgue measure on X (with density dP/dpi denoted 
by p). 

Let CP be the set of all countable Caccioppoli partitions of X. It is known 
that there exists a metric al on CP such that the subset CP;, of all partitions 
whose "perimeters" are bounded by a fixed constant b > is a compact 
metric space when equipped with d. (The definitions and main properties of 
Caccioppoli partitions can be found in [33].) 

A partition is a family r = {Tj}j>i of measurable subsets of X such that 
P(X\\Jj tj) = 0, PfanTji) = for "every j / j' and possibly P(jj) = 0. The 
cardinality of r is the number of j > 1 such that P(Tj) > 0. Given a compact 
set M of M and r € CP;, , it is easy to verify that one can associate mj 6 A4 
with every Tj, yielding a marked partition {(r,-, uij)}j>i, then modify the 
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definition of d so that the set of all marked partitions of CP;, is also a compact 
set when equipped with d. It is worth noting that, if d[(r°,m°), (r 1 , m 1 )] < S, 
then there exists a bijective map <p from 1° = {j : P(t?) > 0} to {j : P(rj) > 
0} such that P(tjAt^^) < 5 and |m° — "^w^l < <5 for every j G 7° (A 
denotes the symmetrical difference between sets). 

In this example, for every K > 1, 6jf is the set of all marked partitions 
of CPb with cardinality at most K. {Qk , d) is a compact metric space, hence, 
the first half of the compactness assumption Al. For a a priori known, let us 
denote by 7(-;m) the density of the Gaussian distribution with mean m and 
variance a 2 , f e (x) = 2~2 k > 1 m k t{x G T k } and finally pg(z) = j(y; fg(x))p(x) 
[for all z = (x, y) G Z = X X R, K > 1 and 6 G Qk] ■ Let have pg for density 
with respect to n, then set % = {Pg : G O^} for every > 1. 

In this setting, one observes = with 



where X\, . . . ,X n are i.i.d. and P-distributed, e±, . . . ,e n are i.i.d. and inde- 
pendent of X\ , . . . , X n , with centered Gaussian distribution of variance 1 , 
and there exists 6* G @k* \ @K*-i such that /* = fg* . In this case, Z\, . . . , Z n 
are i.i.d. and Pg* -distributed. 

Exploring the assumptions. Levy's continuity theorem implies that the 
second half of Al is satisfied. Besides, the continuous parameterization as- 
sumption A2 is obviously verified. It is easily seen that the bracket as- 
sumption A3 holds. Indeed, if one introduces / = inf ig and / = sup 4) 
(the suprema range over Boo), functions l,u G M z can be defined such 
that (u — I) is continuous, I < ig < u (all 6 G Boo) and 2o~ 2 (u — l){z) = 



{f + f)(x) + 2\y\(f - f)(x), hence, (n - l) 1+c G C T (P*) for some c> 0. 



Furthermore, if the L 2 (P)-norm is denoted by || • H2, then it is worth 
stressing that, for every 0,t G Boo, 



Using (20) yields (the proof is postponed to Section E.2) the following: 



Lemma 7. In the AC example, if P* G IFx, \ 11^, then H(P*\ILk+i) < 
H(P*\U K ). 



At this stage, Proposition B.l of Section B applies. The proposition guar- 
antees that underestimation eventually does not occur almost surely. Gn the 
contrary, Proposition B.2 does not apply because the required P*-Donsker 
properties are not verified. Thus, the overestimation probability cannot be 
controlled. 



Yi = f*(Xi) + <Tei 



(i = l,...,n) 



(20) 



H{Pg\P t ) 



Wfe-ft 
2a 2 
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As for the efficiency issue of the underestimation rate, the assumptions of 
Theorem 7 are valid in this example. If P* G LT^* \ it is clear that, 

for every 6 G 9^, H(P g \P*) is finite, p e £ e G L 1 ^) and £ e G C T (P*) [this is 
assumption (i) of Theorem 7] . Finally, for all n > 1 , Z\ , . . . , Z n are mutually 
different P*-a.s. and, as shown in Section E.3, 

Lemma 8. In the AC example, the finite sieve assumption (ii)' of The- 
orem 7 is satisfied. 

In summary, i-**-a.s., > > K* eventually and the part of Theorem 
6 which deals with overestimation and Theorem 7 apply in the AC example. 

Comment. This example is original in the order identification literature. 
It is related to variational image segmentation theory, although it does not 
entirely fit in this general framework (because we observe the random re- 
sponses Yi at random points Xi rather than the random responses Y x at all 
x 6 X). This framework of order identification is a priori difficult, notably 
because the parameter sets Ok are not finite-dimensional. 

The following medical problem is conveniently modeled by the AC ex- 
ample. Let us suppose that a disease is characterized by distinct levels of 
expression k = 1, . . . , K*, whose number K* is unknown. Let us also assume 
that: 

• The mean of a clinical measure Y (modeled by a Gaussian random variable 
of known variance a 2 ) is uniquely characterized by the level k of expression 
of the disease. 

• Simultaneously, there exist q > 2 feature (demographic, diet, clinical) mea- 
surements (x 1 , . . . ,x q ) G X and a segmentation r* = (T^)i<k<K* of the 
space X of their possible values, so that each t£ corresponds uniquely to 
the level k of the disease. 

Then, if one observes both Xi = (Xj , . . . , Xf) and Yi for i = 1, . . . , n patients, 
one may wish to estimate the number K* of distinct levels of the disease. 

5.3. Various regressions example. Let {tx}K>i be a uniformly bounded 
system of continuous functions on [0, 1] . Let us also assume that it is an 
orthonormal system in L 2 ([0,1]) (equipped with Lebesgue measure). Let a 
be a priori known and j(-;m) be the density of the Gaussian distribution 
with mean m and variance a 2 . Let A4 be a compact subset of R that contains 
0. Let us define 6^ = M (each K > 1). For every 9 G @k, let us set 
fe = 2~2k=i0kt k and pe(z) = j(y, fe(x)) (all z = (x,y) G [0,1] x R). Let P e 
have pg for density with respect to Lebesgue measure on [0, 1] x R, then set 

n K = {P e :9ee K }. 
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In this setting, one observes Z± = (Xi,Yi) with 

Y i = f{X i )+ae i (i = l,...,n), 

where X\ , . . . , X n are i.i.d. and uniformly distributed on [0, 1], ei, . . . , 6^ are 
i.i.d. and independent of Xj., . . . ,X n , with centered Gaussian distribution of 
variance 1, and there exists 9* G &k* \ @K*-i such that /* = fg*. In this 
case, Z±, . . . , Z n are i.i.d. and P*-distributed. 

Exploring the assumptions. The compactness assumption Al is clearly 
satisfied (by virtue of Levy's continuity theorem for ILx). Besides, the con- 
tinuous parameterization assumption A2 is readily verified. The bracket as- 
sumption A3 holds: with / = inf £g and / = supig (the suprema range over 
9 G 0oo)> l,uE can be defined such that (u — I) is continuous, I < Eg <u 
(any 9 G 8^) and 2a\u - l)(z) = (f + f){x) + 2\y\Q - f)(x), hence, 
(u — l) 1+c G C T (P*) for some c > 0. We emphasize that equality (20) also 
holds in this example when || • ||2 denotes the L 2 ([0, 1]) norm. A straightfor- 
ward consequence follows: 

Lemma 9. In the VR example, if P* G LToo \ U K , then H(P*\U K+1 ) < 

h(p*\il k ). 

Now, it can be proven that the classes Q\ (all K > 1) defined in (4) 
are P*-Donsker (by mimicking the proof in the LM example), hence, the 
consistency conclusions of Theorem 3 are valid. 

As for the efficiency issue of the underestimation rate, the assumptions 
of Theorems 7, 8 and 9 are satisfied in this example. If P* G Hr* \ ^k*~i, 
it is clear that, for every 9 G Ooo, H(Pg\P*) is finite, pglg G L 1 (fi) and £g G 
C T (P*) [this is assumption (i) of Theorem 7]. Moreover, following the proof 
of Lemma 6 yields the following: 

Lemma 10. In the VR example, the finite sieve assumption (ii) of The- 
orem 7 is satisfied. 

Now, it has been already argued that (u — l) 1+c G C T (P*), hence, (it — I) G 
M T (P*) and assumption (iii) of Theorem 8 is valid. Furthermore, a crude yet 
careful application of Taylor's integral remainder theorem yields assumption 
(iv) of Theorem 8. In conclusion, the models are exponential in the VR 
example, so Theorem 9 applies. 

Concerning the efficiency issue of the overestimation rate, the assumptions 
of Theorem 10 have been verified in the lines above. 

In summary, Theorems 3, 6, 7, 8, 9 and 10 apply in the VR example. 
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Comment. We present this example because it fits in the general frame- 
work of order identification in nested exponential models. This important 
framework has been investigated in [25] and [31] (who actually address the 
more general case of regular models). In the latter, the authors study the 
properties of (with a prior bound on the true order). They prove its 
weak consistency. Rates of underestimation and overestimation similar to 
the ones of Theorems 7 and 10 are obtained. However, the underestimation 
error exponent is not shown to be at most H(Uk*-i\P*) and, of course, is 
not compared to it. 

Thus, to the best of our knowledge, the results of Theorem 3, 6, 7, 8 and 
10 are new in this exponential model framework for K\ (which does not 
require any prior bound on the true order), while the results of Theorems 3, 
6 and 9 are new for K^- 

APPENDIX A: AN EXAMPLE OF THE PEELING DEVICE 

The so-called "peeling device" classically allows one to analyze the rate of 
convergence of M-estimators in nonclassical frameworks. The original idea 
is due to Huber [28]. Examples may be found, for instance, in [37] for simple 
proofs of uniform central limit theorems or in [6] (see Proposition 7 therein 
and the attached remark) in a framework of risk bounds model selection. 
Another form of this device is the core of [21], where it applies to an order 
estimation problem for a mixture with Markov regime. 

Proposition A.l. Set K 2 > K\ > K* , the order of P* . Then, both 
inequalities below hold, the second one providing an example of the peeling 
technique: 

(A.l) sup \QP n -P*)(l e -e)\> sup F n £ g - sup F n £g 

e&e Ii2 ee0K 2 oee Kl 



and 



(A.2) sup 



2 

> sup F n £ e - sup FJ 
eee K2 0ee Kl 



Proof. Inequality (A.l) is readily proved, since 

sup F n £ e - sup F n £ e < sup F n (£ B -t) 
e&e K2 eee Kl e&e K2 

= sup {(F n -P*)(£ e -t)+P*(£ e -£*)} 
eee>K 2 

< sup (F n -P*)(£ e -t). 

8&®K 2 
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For (A.2), let us define for all 9 € Q I<2 such that H(9) > (i.e., P* / P g ) 
the scaled log-densities ratio 

= £e-e_ 
96 H(9) 1 / 2 

and gg = otherwise. Now, for any 6 € ©j^i H{9) nonnegative yields 
K(£o ~ n + H(6) = (P„ - P*)(£g - 1) 

(A.3) 

<H{e) 1 ' 2 sup (p n -p*) 5e . 

Let us set some 9q € Qk 2 such that both sup flg0Jf F n (£g — £*) < F n (£o — 
t) + e and ¥ n (£g - t) > 0. Then, (A.3) implies, for 9 = 9 , 

sup P n (£ e - 1) < H(9 ) 1/2 sup (P n - P*)g e + e. 

Furthermore, P n (^e ~~ ^*) — combined with (A.3) imply in turn 
H(9 )<H(9 ) 1 / 2 sup (F n -P*)g e , 

hence, 

sup P n ^- sup F n e < sup P„(4-^)<f sup (F n -P*)g e ) +e, 
eee K2 eee Kl e&e K2 \e&e K2 / 

which completes the proof, since e > is arbitrary. □ 

APPENDIX B: PROOFS OF CONSISTENCY 

B.l. No underestimation eventually. A strong law of large numbers for 
the supremum of the likelihood ratios is stated. Its routine proof relies on 
the achievement of H(P*\Uk), the standard strong law of large numbers 
and the Borel-Lebesgue property. 

Lemma B.l. P*-a.s., for any K > 1, 

sup n- l (£ n (9)-£ n (9*)) — ► -H(P*\IL K ). 
ee& K n -*°° 

Now the result of no underestimation can be stated and proved. It is seen 
in Section 5 that Proposition B.l fully applies to the LM and AC examples. 
It is also shown that the VR example satisfies the assumption in the case of 



Proposition B.l. Let us assume that P* e Hk* \IIk-*_i. 



TESTING THE ORDER OF A MODEL 



27 



• If P* <£ U K implies H(P*\U K+1 ) < H(P*\U K ), then P*-a.s., > K* 
eventually. 

• P*-a.s., > K* eventually. 

Proof. Let us abbreviate "infinitely often" to i.o. and prove that P*{K^ < 
K* i.o.} = (minor changes allow us to cope with K^)- By the union bound, 
it suffices to show that P*{K^ = K i.o.} = for K = 1, . . . , K* - 1. Now, if 
we denote by <5 = H(P*\U K+1 ) - H(P*\U K ) < 0, 

P*{Kl; = K i.o.} < P*\ sup F n £ g - sup F n i e > -5/2 i.o.) 



sup W n £ e )>-5/2) : 



<P*<HimhrN sup F n t e 

I n ^°° leee K see 

where the first inequality stems from the definition of the penalty function 
A4 and is satisfied for n large enough. Finally, Lemma B.l ensures that the 
right-hand side probability is zero, which concludes the proof. □ 

The proof of Theorem 5 also fits in this "no underestimation" section. 

Proof of Theorem 5. P* ^ IIoo because otherwise there would exist 
a K > 1 such that H(P*\Hk) = 0. Lemma B.l implies that, P^-a.s. and for 
all K> 1, 

sup F n £ g - sup FJ e — > H(P*\U K ) - H(P*\U K+1 ) > 0. 

es0 K+1 eee K n ^°° 

Therefore, by virtue of the definition of the penalty function A4, P*-a.s., 
crit(n, K + 1) — crit(n, K) — > oo, 



n— »oo 



hence, > K% > K for n large enough. This is true for any K > 1, so the 
proof is complete. □ 

B.2. No overestimation eventually. 

Proposition B.2. Let us assume that P* G Uk* \ Hk*-i- 

• If(p(u — l) £ L 1 (P*), ifGx*+i (resp. G\* + i) is P* -Donsker, then whenever 
pen satisfies the condition of Theorem 3 (resp. Theorem 4), P*-a.s., K 1 ^ < 
K* eventually. 

• Let K* < K max . If(p(u-l) £ L 1 {P*), ifG1c max (resp. Q\ ma ) is P* -Donsker, 
then whenever pen satisfies the condition of Theorem 3 (resp. Theorem 4), 
P*-a.s., K^<K* eventually. 



28 



A. CHAMBAZ 



It is proven in Section 5 that the assumptions of the proposition are 
satisfied in both cases a (i.e., under the assumptions of Theorem 3) and b 
(i.e., under the assumptions of Theorem 4) in the LM example. In the VR 
example, they are satisfied in case a. 

The following lemma is a bounded law of the iterated logarithm stated 
in convenient terms for our purpose. It is a simple consequence of Theorem 
4.1. in [18]. It is involved in the proof of Proposition B.2. 

Lemma B.2 [18]. Let us assume that tp{u — I) € L 1 (P*) and that, for 
some K > K* , Q = Q K (resp. Q = Q\) is P* -Donsker. Then there exists a 
positive constant Ck such that, P*-a.s., 

n 1 / 2 sup qg c \(P n -P*)g\ 

hmsup -f— — < C K - 

n^oo (log log n) 1 /^ 

Proof of Proposition B.2. Set K = K* + 1. 
P*{K% > K* i.o.} 

sup F n l 



sup . 

eee K 



< P* 



(n log log n 
pen(n, K*) 



sup 

eee K * 

1/2 



£o>n 1 {pen(n, K) — pen(n, K*)} i.o.j 



n 



1/2 



P*)9\ 



> 



(log logn) 1 / 2 
pen(n, K) 



1 i.o. 



pen(n, K*) 

where the last inequality is straightforward [it is (A.l)]. Consequently, when- 
ever <p(u — I) G L 1 (i- > *) and Q K is P*-Donsker, Lemma B.2 applies and 
implies that, if pen satisfies the condition of Theorem 3, then P*{K^ > 
K* i.o.} = 0. 

Now, renormalization yields an alternative bound for the second proba- 
bility in the display above [by using (A. 2) of the peeling technique Proposi- 
tion A.l], namely 

P*{Kjt > K* i.o.} 

■ 

< P 



log log n 



pen(n, K* 



(log logn) 1 / 2 



P*)dU 2 pen(n,K) 

> r : - 1 i.O. 



pen(n, K*) 



Therefore, if (p{u — I) € L l (P*) and Q K is P*-Donsker, Lemma B.2 ap- 
plies and implies that, as soon as pen satisfies the condition of Theorem 
3, P*{K^ > K* i.o.} = 0. This concludes the study of 

Furthermore, if K* < K max , then the union bound guarantees that it 
suffices to prove that P*{K® = K i.o.} = for K = K* + 1, . . . ,-K" max in 
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order to conclude the study of K^. Minor changes in the previous lines 
yield the result. □ 

APPENDIX C: PROOFS OF EFFICIENCY: UNDERESTIMATION 

C.l. Proof of Theorem 7. Theorem 7 is first proven under assumption 
(ii). The modification of the proof under assumption (ii)' is sketched at the 
end of this subsection. Let us begin with some useful lemmas. 

Lemma C.l. Under the assumptions of Theorem 7, the sets A Qj x and 
^a,K o,re measurable and closed in Q for every a > and K < K*. 

Proof. The measurability issue is obvious. Set a > and K < K* . We 
shall actually prove that K c a K is an open set (the same proof applies to 
F a ,K, up to minor changes). To this end, let us point out that the topology 
on Q is generated by the collection of open sets 

0(f,x,e) = {QG Q:\Qf -x\ <e} (any / G C T ,x G R,e > 0). 

Choose Qo G k c a K , a',e > such that a' — Qe > a and sup 0g Q K Qo£e — 
sup 9( zQ K+1 Qo£g < —a'. Let us denote by Tjc (resp. Tk+\) the finite sieve 
subset of @k (resp. @k+i) for Q = Qo, e and K (resp. K + 1) in assumption 
(ii). Let us then define the open neighborhood V of Qo by 

V= p| {QeQ:\Q£t-Q £ t \<e}n f| {Q G Q: \Qt t - Q £ t \ < s}. 
t& K teT K+1 

Straightforwardly, whenever Q £V, 

sup Q£g < sup Qo@e + 3e 

and 

sup Qo£ e < sup Q£q + 3e, 
6»e8x+i eee K 

hence, Q £ A^ K - So F is an open neighborhood of Qo included in A c aK . 
This completes the proof of the lemma since Qo was arbitrarily chosen in 

K,K- □ 

Lemma C.2. Under the assumptions of Theorem 7, for every K < K*, 

c a ,k n r ,K and p* £ a ,^ u r , K - 

PROOF. Let r* be the convex-conjugate of r, given by r*{t) = (1 + 
|i|)log(l + \t\) — \t\ (all t G M). One can substitute r* for r in the defini- 
tions (1) of C T {P*) and (3) of || • || T , yielding a Banach space (C T * (P*), \\ • \\ T * )• 
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Now, according to (2.2) in [32], P*\fg\ < 2||/|| T || 5 || T . [all / € C T (P*), g € 
C T *(P*)], so that C T *{P*) can be identified with a subspace of C' T {P*). 

Furthermore, it is readily seen that the density of any Pg £ Uk*-i with 
respect to P* belongs to £ r *(P*), hence, Hk*-i C Q. 

Let us choose K < K* and Pq G ILx. Since Pe lo £ i 1 (//) , sup 06 Q , Pe o ^0 = 
- inf ee e A „ #(Pe |P )+P 0O 4 O = P eo 4 when K' > K. Straightforwardly, U K C 

Besides, P* S Ao,^ would yield 

supP%- sup P^ = -F(P*|n^) + F(P*|n^ + i) = 

eee K 0ee K +i 

and P* € would yield, in turn, 

0< sup P*£o- sup P*e e = -H(P*\U K ), 

eee K 6ee K * 

where the right-hand side term is negative because H(P*\-) achieves its 
infimum on the compact set Tlx and P* ^ Hk- This completes the proof of 
the lemma. □ 

The proof of Theorem 7 follows. 

Because P*{K% < A'*} = T,K<K* p *(^n = K }, Lemma 1.2.15 of [17] en- 
sures that 

limsupn _1 logP*{A^< K*}= sup limsupn" 1 log P*{K% = K}. 

n—*oo K<K* n— »oo 

Thus, it suffices to choose K < K* and show that 

(C.l) lim sup rT 1 log P*Lfi^ = K} < -I(A ,k) < 

n— >oo 

in order to get (10). Now, for any a > 0, 
limsupn -1 logP*{A^ = K} < limsupriT 1 logP*{P n E K,k} < -I{K,k) 

n—*oo n— »oo 

by virtue of Theorem 1 and Lemma C.l [cl(A a = A a) #-]. 

Furthermore, {I(A a> K)} nondecreases as a J. and it is bounded by 
H(T1k\P*) by virtue of Lemma C.2. Let us denote L = lim a jo I(A a 

k) < 

I(A ,k)<H(IL k \P*). 

Since I is lower semicontinuous with compact level sets, it achieves its 
infimum on the closed sets A a ^x'- let Q p € A±/ ptK be such that I{Q P ) = 
I(Ai/ p K ) for every p > 1. For any q > 1, the set cl({Q p :p > q}) is compact 
[it is closed in the compact set Ai/ qK n {Q € Q:I(Q) < L}]. By virtue of 
the Borel-Lebesgue property, the intersection of the nonincreasing sequence 
of nonvoid compact sets {c\({Q p :p > q})}q>i is nonvoid too, so Q can be 
chosen in the intersection. 
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Now, it is readily seen that both Q G and I(Q) = /(Ao^) = L. Fi- 
nally, Lemmas 2 and C.2 guarantee that I(Q) >0 and yield (C.l), hence, 
(10). 

The proof of (11) for is almost identical and is omitted. 

Proof under assumption (ii)'. Let us assume that P*-a.s., for all n > 1, 
Zi,...,Z n are mutually distinct. Then P*-a.s., P n G V' , where V' is the 
subset of V when adding the condition that zi,...,z p must be mutually 
distinct in the definition of V . Besides, since / is infinite on V, one can 
substitute V' for V in the definition of Q (see Lemma 4.1.5 in [17]). 

The framework introduced for the large deviations principle was inten- 
tionally somewhat too simple (for sake of legibility) . Under assumption (ii) , 
this is just a matter of convention. When dealing with assumption (ii)', we 
must be more careful. 

Now, rigorously, Q G V is a linear form on L T (P*), which has the same 
definition as C T (P*) except that P*-almost everywhere equal functions are 
not identified. The topology on Q is the coarsest one that makes the linear 
forms Q t— > Qf continuous for all / G L T (P*). This change has no effect on 
Qr\C' T (P*). It nevertheless allows to prove that each Q G V' is its own open 
neighborhood in Q. 

Indeed, choose Qo = p~~ l J2i=i $Zi G V . Let u > 1 be such that u/ (u — 1) < 
(p + l)/p and V = f]Zi{Q e Q ■ \Qt{zi} - l/p\ < (upy 1 }. Then 

• Qo £ V and V is open. 

• If Q G V, then Q £ V (otherwise, Ql{zi} = 0). 

• If Q = m^ 1 YaLi &(i i then {Ci, • • • , Cm} D {zi, . . . ,z p }, hence, particularly 
m > p. 

• Finally, Q G V yields \l/m — l/p\ < (up)^ 1 , which implies, in turn, m < 
p + 1, hence, m = p and Q = Qo- 

This property allows us to adapt straightforwardly the proof of Lemma 
C.l under assumption (ii)', proving thus the last statement of Theorem 7. 
□ 

C.2. Proof of Theorem 8. Let us first state some preliminary lemmas. 

Lemma C.3. Under assumptions (i) of Theorem 7 and (hi) of Theorem 
8, if Q G Qn C' T (P*), then the function 6 t— > Q£g mapping Qk* to K is 
continuous over 6^* • 

Lemma C.4. Let us choose Q G Q and K < K*. Under assumptions (i) 
of Theorem 7 and (iv) of Theorem 8, the function 9 i— > Q£q mapping ®k to 
IR is differentiate on the interior of &k, with derivative 6 i— > Q£$. 
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In order to show Lemma C.3, it is sufficient to prove that \\£g — £q \\ t — ► 
when 6 p — ► 9q m ©A* (dominated convergence theorem). Lemma C.4 sim- 
ply relies on the positivity of Q G Q. Combining both lemmas yields the 
following: 

Lemma C.5. Let Q G C' T (P*) be P* -singular (i.e., Q = Q S ). Then, un- 
der the assumptions of Theorem 8, 6 i— ► is constant over &k* ■ 

Consequently, by applying Lemma C.5 to Q = Q (see the end of the proof 
of Theorem 7 in Section C.l), Q G Ao,k yields Q G Ao,a: H M\(Z). For- 
wardly, 

77(A ,a- n M!(2)|P*) < 77(Q a |P*) = 7(Q a ) 

< 7(Q) = 7(A 0)K ) 

< 7(A , k n Mi (Z)) = H(A 0:K n Mi(2)|P*), 

which concludes the proof of Theorem 8 for . The study of goes along 
the same lines, up to minor changes. □ 

C.3. Proof of Theorem 9. The proof relies heavily on Lemma 5, which 
is shown at the end of this section. If one resumes the proof of Theorem 8, 
it is clear that the following proposition straightforwardly yields the result 
of Theorem 9: 

Proposition C.l. If Q G C' T (P*) n M X (Z) satisfies 

H{Q\P*) = 77(r 0j ^_i n Mt(Z)\P*) < oo, 

then, under the assumptions of Theorem 9, Q G ILk*-i and H(Q\P*) = 
77(11^-1 |P*). 

Remark C.l. A simple modification of the proof below implies that, 
under the assumptions of Theorem 9 and for K = K* — 1 , 

H(A 0>K nM 1 (Z)\P*) = H(IL K \P*). 

The proof cannot be adapted anymore when K < K* — 1 (the unadaptable 
argument is pointed out). 

Proof of Proposition C.l. Let us set K = K*— 1 and Q as described 
in Proposition C.l. Let us suppose that the assumptions of Theorem 9 are 
valid. The hard part is to show that Q G 11^ since Lemma C.2 guarantees 
that Ha- c r 0j jffl¥i(2). 

Because II ^ is compact and H(Q\-) is lower semicontinuous, there ex- 
ists P G Hk (whose density dP/d/i is denoted by p) such that H(Q\P) = 
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H(Q\TIk)- According to the definition (14), P is the reversed- //-projection 
of Q on lift-. 

We prove hereafter that Q = P G Hr, which is the expected result. 
Let us introduce the subset C of M\(Z) r\j£' T (P*) defined by 

C = {Q:H(Q\P*)< 00} n{Q:Qlogp= sup Q£ e ) 

I eee K + ) 

n{Q:Q</x, dQ/dfi = q,Qlogq<oo}. 
The following properties hold (their simple proofs are omitted): 

• C is convex, Q G C and C C To,^. 

• H(Q\P*) = H(C\P*). 

• For every Q G C, H{Q\P) = H(Q\U K ) = H(Q\U K *). 
Accordingly, 

• Q is the H -projection of P* on C. 

• P is the reversed- //-projection on Tlx* of every QgC. 

Since the assumptions of Lemma 5 are satisfied, for every Q EC, 

H(Q\P*) > H(Q\P) + H(P\P*) 

[just choose P* in (15) — we point out that this argument is not adaptable 
when dealing with Ao,^ for K < K* — 1]. Consequently, the characterization 
of Lemma 4 guarantees that, necessarily, P is the //-projection of P* on C, 
that is, Q = P, hence, Q G Hr- This completes the proof of Proposition C.l. 

□ 

Proof of Lemma 5. Since H(Q\ILk*) < oo, there exists Pg such that 
H{Q\P e ) < oo, hence, Q<€.Pe and Q < 

Obviously, if (15) holds, then //(Q|P) = iZ"(Q|IIjf*) and H{Q\P e ) = H{Q\P) 
yields H(P\P e ) = 0, hence, P = P. 

Conversely, the exponential nature of the model is needed: 

Pe {z) = h(z) exp[6 T t(z) - 0(0)] (all z G Z), 

where t = (t\ , . . . , t a - * ) is a known function onZcl' equipped with Lebesgue 
measure \i on Borel sets, h G is measurable, and O^-* is a convex subset of 
the convex and open natural parameter space O = {9 G : n(hexp(9 T t)) < 
oo}; 0(0) = log fi(hex.p(6 T t)) (all G 0). Let us emphasize that (p is convex 
and differentiable on 0, with 0(0) = Pot. 

Let Q be chosen as described in the lemma. Let P be its reversed- H- 
projection on Hk* (with density dP/dfj, denoted by p). Inequality (15) is 
obvious if H(Q\Pq) = oo. Consequently, only the parameters G G c = {0 G 
:H(Q\Pq) < oo} have to be considered. 
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Now, it is readily seen that the set G c is convex. Moreover, it is an open 
set. Indeed, Qlogq and Q£g are finite [because Q € £' T (P*) and £g £ C T (P*)], 
so the decomposition H(Q\Pg) = Q log q — Q£g is valid. Assumption (iv) 
guarantees then that C is open. 

Besides, because Qlogq and H(Q\P) are finite, (15) is equivalent to (Q — 
P)logp/pg > (any 9 £ O c ). Denoting P by Pg finally implies that (15) is 
equivalent to 

(C.2) (9-9) T (Q-P)t >0 (all0e9 c ). 

This concludes that part of the proof. 

Now, the decomposition H(Q\Pg) = Qlogq- Q£g and H{Q\P) = H(Q\U K *) 
also imply that 

(C.3) 0<Qlog— <oo (all0e9 c ). 

Pe 

Let us define / on C by f(9) = Qlogp/p e = (9 - 9) T Qt + </>(6) - <t>(6). Then 
the convexity of <f> and (C.3) imply that / is a proper convex function on C . 
Furthermore, / is differentiable at 9 with gradient f(9) = (P — Q)t. Since / 
achieves its minimum at 9 by virtue of (C.3), Theorem 27.4 of [39] applies, 
hence, 

(9 - 9) T (-f(9)) = (9- 9) T (Q - P)t < (all 9 e 6 C ). 
This is exactly (C.2), so the proof is complete. □ 

APPENDIX D: PROOFS OF EFFICIENCY: OVERESTIMATION 
Let us denote by A K = D(K + 1) - D{K) > and K = K* + 1, 

(D.l) P*{K%>K*}<P*{ sup FJ e - sup FJ e > tT^Ak* 

(D.2) < P*S sup \(F n - P*)g\ > n^v n A K A, 

by virtue of (A.l). Also, the peeling device inequality (A. 2) of the same 
proposition implies that expression given by (D.l) can be bounded by 

(D.3) P*(( sup \(F n -P*)g\) 2 >n^v n A K *\. 

In the rest of this paper, we shall focus on (16) in Theorem 10 [on the 
basis of the overestimation probability upper bound (D.2)]. The proof of 
(18) in Theorem 11 [on the basis of the overestimation probability upper 
bound (D.3)] is similar and is omitted. 
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Let us define A°° = {b G £°° : \\b\\g* K >A K *}. It is closed for the uni- 
form topology on &°°{G%)- Since the assumptions of Theorem 2 are satisfied, 

lim supra/" 2 log P*{x£ > K*} < limsupn^ 2 logP*{(n^ 1 )(P n - P*)°° G A 00 } 

n— >oo n—*oo 

< — inf{ J(6) : b G A 00 }. 

Let us prove that the right-hand side term above is negative. 

Suppose indeed, on the contrary, that the infimum is zero: this implies 
G A°°, which is obviously not true. If the infimum were zero, then there 
would exist a sequence {b p } of elements of £°°(G%) such that b p G A°° and 
J (bp) < Consequently, there would exist a sequence {Q p } of elements 
of M(Z) such that, for every p>l, Q p <C P* (with derivative dQ p /dP* 
denoted by g p ) and both P*g 2 /2 < J(b p ) + l/p < 2/p and = bp. Thus, 
for any g£Q%, 

(b p gf = (P*q p g) 2 < (P*q 2 p )(P* g 2 ) < (4/p) f sup PV ) 

by virtue of the Cauchy-Schwarz inequality. Now, Q\ is P*-Donsker, hence, 
it is totally bounded in L 2 (P*), and the above display implies that H&pHgj. = 
o(l). Consequently, G A°° as a limit of a sequence of elements of the closed 
set A°°. 

This completes the proof of (16) of Theorem 10. 

The proof of (17) in Theorem 10 [which parallels the proof of (19) in 
Theorem 11] is very similar. Once again, the union bound and Lemma 1.2.15 
of [17] imply that 

limsupn^ 2 logP*{A^ > K*} 

n— >oo 

= sup lim sup n-u~ 2 log P*{K^ = K} 

K n— >oo 

< suplimsupnw~ 2 logP*< sup P n 4 - sup F n £g > n~ 1 v n AK \ 

(sup^ stands for sup K * <K<Km ^). This bound is handled as the bound (D.l) 
above, hence, the final result. □ 

APPENDIX E: PROOFS FOR THE BENCHMARK EXAMPLES 

E.l. Proof of Lemma 6. Lemma E.l allows us to focus on the restrictions 
of Iq's to a well-chosen compact set of Z. 

Lemma E.l. Let ip £l + R+ be an increasing nonnegative function such 
that ip(x)/x — > oo as x — > oo. Let us assume that (u — I) is continuous on 
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Z C M 9 , (u — l)(z) -too as \z\ — > oo, and i/iai i/j(u — I) G C T (P*). Then 
(it — £ .M r (P*) ana*, /or all e> and Q G Q, i/iere exists a compact subset 
C of Z such that Q(u - l)t{C c } < e. 

PROOF. It is easily verified that (u — l) G .M T (P*). Besides, the set {z G 
Z : (u - l){z) < M } is compact for any M > and Q(u - l)t{(u -1)>M}< 
^QHu - I) as M -» oo. □ 

Of course, the assumptions of Lemma E.l are satisfied in the LM example 
[here, ip(x) = x l+c (any x > 0)]. Thus, let us set K > 1, Q £ Q and e > 0. 
There exists a compact set C of Z such that, for every 6, t G 0^, 



(E.l) |Q(4 -4)| < Q|4 - £ t |l{C} + Q(u - Z)l{C c } < Q% - l t \±{C} + e. 



Now, Ascoli's theorem ensures that {£el{C}:9 G ®k} is precompact in the 
set of the continuous functions on C equipped with the uniform norm. Con- 
sequently, there exists a finite subset T of @k such that, for every 9 G @k, 
there exists t G T such that sup zGC \£e(z) — (k(z)\ < £■ Straightforwardly, for 
any 9 G ®k, there exists i G T such that the left-hand side term of (E.l) is 
bounded by 2e. This completes the proof. □ 

E.2. Proof of Lemma 7. Let us suppose, on the contrary, that 



that is, that equality holds. Lower semicontinuity of H(P*\-) and compact- 
ness of Hk ensure the existence of Po = Pq G 11^ such that H(P*\Pq) = 
H(P*\U K ). Let us denote f {x) = fe (x) = Y,%- 1 m k l{x G r fc } (all x G X). 
Now, equality (20) and ||/* - / || 2 = £f =1 P(/* - m k ) 2 l{r k } imply that 
mk = Pf*t{Tk}/P(rk) for k = 1, . . . , K. Let us prove that /* = /o, hence, 

P*elL K . 

Indeed, (E.2) ensures that, for any 1 < ko < K, for any subset S of Tfc 
with positive P-measure, 



(E.2) 



H(P*\U K )<H(P*\U K+1 ) 



K 



(f/*ih}) 2 

P(r k ) 




pr 2 - E 



<pf 



k=l 



or, equivalently, 



(pn{s}f (prt{r k0 \s}) 2 (pri{^» 2 

P(5) P(r fc0 \5) P(r fe0 ) 
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Thus, first expansion of the right-hand side term and then factorization yield 

^f^(prns}f + ^f%-jprt{r k0 \ s}) 2 

(E.3) () {kA ] 

<2(pr±{s})(pf*i{r ko \s}). 

Now, the basic inequality 2ab < (au) 2 + (bu~ 1 ) 2 (all a, b G R and positive u) 
together with (E.3) ensure [take u 2 = P{r ko \ S)/P(S)] that equality holds 
in (E.3). Consequently, for any subset S of T ko with positive P-measure, 

ppi{s} _ Pfifa \ s} _ ppt{r ko } - pths} 

P(S) P(r ko \S) P(r k0 \ s ) 

hence, for any subset S of T ko , 

Pf*l{S} = -^\ptl{T k0 }. 

The choice S = S + = {x G r ko : f*(x) > P 'f*l{r ko } / 'P(r ko )} yields P(S+) = 
0. The choice S = 5_ = {x G r ko : f*(x) < Pf*l{T ko }/P(r ko )} yields, in turn, 
P(S^) = 0, hence, finally -P(5o) = P(T ko ), where S = {x G r fco :f*(x) = 
Pf*l{T ko }/P(T ko )} (i.e., /* P-a.s. constant on T ko ). This concludes the proof 
because ko is arbitrary. □ 

E.3. Proof of Lemma 8. Let us set K > 1, Q G Q n ( with de- 

composition Q = (J a + Q s according to Lemma 1) and e > 0. Because Q a <C 
P*, there exists S > such that, for any measurable P, P*(P) < 5 yields 
Q a (P) <£• 

Now, it was emphasized in Section 5.2 that (u — l) 1+c G C T {P*) for some 
c > 0, hence, Lemma E.l applies with ip(x) = x 1+c (all x > 0). So, there 
exists a compact set C of Z such that, for every 6,t G 0if , 

- 4)1 < Q|4 - 4|1{C} + Q(« - l)t{C c } 
<Q\£e-£t\HC} + e 

(E.4) 

<MQ a \f e -ft\+e, 

where the equality holds because (£g — lt)t{C} is bounded and M is a 
constant which depends only on l,u (via C) and A4. 

Furthermore, the Borel-Lebesgue property of compact sets guarantees 
that there exists a finite subset T of O^- such that the union over t G T of 
the balls of center t and radius 5 covers Qk- Let us set t G T [t = (T°,m )] 
and G 6jf [0 = (t , m )] with ci(t, 6>) < 5. It can be assumed without loss of 
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generality that P(TjArj) < 5 and \rm?- — m l - \ < S for all j = 1, . . . , K. Conse- 
quently, with notation, M' = sup{|m| : m £ Ai}, for any x G X , 

K K 

- M(aO < £ I" 1 ? - e ^ n T i } + M '(^ - 1) E ^ e T i Ar )l 

3=1 3=1 

if 

< + M'(iT - 1) E 1 { a G r i Ar ii' 
3=1 

hence, 

K 

Q a \fe -ft\<KS + M'(K - 1) Q a ( T j AT j x K). 

3=1 

Besides, P*(t] ) Atj l X R) = J P(r°Ar J 1 ) < 5 finally yields Q a (rj ) Ar ? 1 x R) < e. 
By invoking (E.4), |Q(^ — ^t)| < M"e, for a constant M" depending only 
on K, l,u (via C) and .M. This completes the proof. □ 
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